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Oh ' Abstract 

'"pi , Inspired by a concept in comparative genomics, we investigate prop- 

erties of randomly chosen members of G\{m, n, t), the set of bipartite 
graphs with m left vertices, n right vertices, t edges, and each vertex of 
degree at least one. We give asymptotic results for the number of such 
graphs and the number of (i,j) trees they contain. We compute the 
^. ■ thresholds for the emergence of a giant component and for the graph 

OO ! to be connected. 

m 

^ '. 1 Introduction 

Biologists use an Oxford grid to indicate the relationship between two gen- 
omes. It is a matrix with g(i,j) = 1 if part of chromosome i in the species 
A is homologous to part of chromosome j in species B. The corresponding 
Oxford graph is the bipartite graph obtained by letting the chromosomes of 
species A be vertices on the left and chromosomes of species B be vertices on 
the right and with an edge from i on the left to j on the right if g(i,j) = 1. 
^ ■ Figure 1 gives the Oxford graph for the autosomes (non-sex chromosomes) 

of elephant and humans. 

Let Gi(m,n,t), the set of bipartite graphs with m left vertices, n right 
vertices, t edges, and each vertex of degree at least one. The graph in 
Figure 1 is a member of Gi(22, 27,44) but is it a typical member of that 
set? To answer this question we will examine properties of randomly chosen 
members of G\ (m, n, t) and of related families of bipartite graphs. We begin 
by asking how many such graphs there are. To answer this question we will 

*Work done in the summer of 2003 at a Cornell REU supported by the NSF. 
' Partially supported by NSF grants from the probability program (0202935) and from 
a joint DMS/NIGMS initiative to support research in mathematical biology (0201037). 



investigate the model G r (m, n, t): fix a vertex set L of size m and R of size 
n, and pick t of the mn edges between L and R with replacement (picking 
the same edge multiple times is allowed). As usual, we are interested in 
the behavior of these random graphs as t, m, and n go to infinity; when 
using the symbols ~, ~, and — ► we are tacitly assuming the results hold as 
t, m, and n go to infinity. Standard results for the birthday problem (see 
e.g. page 83 of Durrett 1995) show that the probability no edge is picked 
twice is ~ exp(— t 2 /2mn), which converges to a positive limit if t/m — * p 
and t/n — ► A, so not much is changed by picking with replacement, except 
that the next question becomes much easier to answer. 

Q. How big is G^(m, n, t), the subset of G r (m, n, t) in which each vertex has 
degree at least one? 

To relate this to the classical occupancy problem, consider an m x n array 
of boxes and throw in t balls. Let A be the event that each row has at least 
one ball and B be the event that each column has at least one ball. It is 
easy to see that (thanks to sampling with replacement) the probability of 
B is not affected by conditioning on the number of balls in each row, so A 
and B are independent. Using the multinomial distribution 

t\ 



P (A) = - t £ 



where the sum is over all i±, . . . i m > 1 with i± + ■ ■ ■ + i m = t. To evaluate 
the sum we rewrite it as 

t\e am ^ -^r _„ a^ t\e am 
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where Zi are independent Poisson with mean a. 

It is easy to see that P{Z 1 > 1, . . . Z m > 1) = (1 - e~ a ) m . E{Zi\Zi > 
1) = a/(l — e" a ), so if we pick a so that a/ (I — e~ a ) = t/m and let a 2 . = 
var (ZAZi > 1) then 



P{Z 1 + ---Z m = t\Z 1 >l,...Z m >l)~ l/ v / 2ira 2 a m 

A similar analysis applies to P(B) giving the following result. 

Theorem 1 Let a/(l — e~ a ) = t/m and 6/(1 — e ) = t/n and suppose that 
t/m — ► A, t/n — ► p. Then 

(a _ i\m -t( b _ -\\nu-t 

\G r 1 (m,n,t)\ = (nm) t P(A)P{B)~(t!) 2 ^ [ 
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As a consequence of Theorem 1 and the birthday problem result we can 
calculate \Gi(m,n,t)\ up to a constant factor. 

Corollary 1 Under the assumptions of Theorem 1, 

-o\^r ■ f |Gi(m,M)| . v \Gi(m,n,t)\ 

e ~ hmmf |g;km)| - limsup MMl " 

Even more important than allowing us to count the graphs, the proof 
of Theorem 1 allows us to relate our graphs to ones studied by Molloy and 
Reed (1995) and Newman, Strogatz, and Watts (2001). Let Y be a random 
variable with distribution given by 

I e~ a u 

P(Y = k) = — , for£;>l 

v ' l-e- a k\ ' 

and P(Y = k) = otherwise. We will say Y has a truncated Poisson distri- 
bution with parameter a, or P(a) for short. This distribution is the limiting 
degree distribution of a graph from G\(m, n, t) if parameter a is chosen cor- 
rectly. We choose a by equating the means of the two distributions. The 
truncated Poission distribution has mean o/(l — e~ a ) and the mean degree 
of a left (right) vertex is t/m (t/n). 

We can now define a new graph model that mimics the degree distri- 
bution of vertices from G\(m, n, t). Label the left vertices l±, I2, ■ ■ ■ ,l m and 
the right vertices r±, rz, ■ ■ ■ , r n . Let d(k), i = 1, . . . , m be independent P(a) 
random variables where a/(l — e~ a ) = t/m; let d(ri), i = 1, ... ,n be in- 
dependent V(b) random variables where 6/(1 — e ) = t/n. Condition on 
the sum of the d{U) being t and condition on the sum of the dfa) being 
t. Make a set L' (R') with d{U) (d(ri)) copies of vertex Zj (n). Pair up 
elements in V with elements in R' uniformly at random. Finally, collapse 
the vertex copies into a single vertex and let the vertex pairings determine 
the edges of the graph (which may have multiple edges between vertices). 
Call the resulting random graph TP(m, n, t). It is clear that the G\(m, n, t) 
and TP(m, n, i) random graph models have the same degree distribution, 
and it is not surprising that models are, in fact, the same. 

Lemma 1 The models G^(m,n,t) and TP(m,n,t) are the same. 

We give the proof in the appendix. To study the question of the existence 
of a giant component in our graph, we begin with the general case in which 
the degrees of the m left vertices have distribution p^ and the degrees of the 
n vertices on the right have distribution q^. If we examine the cluster of a 



given vertex v on the left then its first generation members (at distance one 
from v) will have distribution p^, but the number of children of a member 
of the first generation will not have distribution q^. A vertex on the right 
with degree k is chosen in the first generation with probability proportional 
to kqk- If we let v = J2k ^9fc an< ^ Qk = (k + l)<7k+i/ I/ then the number of 
children of a child of v will have distribution q~k and mean v = ^2 k kqk- Here 
we have shifted the distribution by 1 to remove the edge that we arrived on 
(so that v is not counted as its own grandchild). Readers who are used to 
the Erdos-Renyi random graphs should note that if qk is Poisson(A), then 
q~k is again Poisson(A). 

Similar calculations apply to the third generation. The members of the 
second generation have size biased degree distributions pk = (k + l)pk+i/fJ- 
where [i = ^ fc kpk and this distribution has mean p. As the reader can 
probably guess by analogy with branching processes, 

Lemma 2 The condition for the existence of a giant component is Jx ■ T> > 1 

Molloy and Reed (1995), who wrote the condition in the equivalent form 
^2 k k(k — 2)pk > 0, proved this in the ordinary (unipartite case), essentially 
by showing that the branching process analogy gives an accurate approxi- 
mation of cluster sizes. Newman, Strogatz, and Watts (2001), motivated by 
studies of the structure of the world wide web, collaboration graphs of scien- 
tists, and Fortune 1000 company boards of directors, extended Molloy and 
Reed's results to directed and bipartite graphs. Since Newman, Strogatz, 
and Watts published in Physical Review E, they did not have to prove their 
results. Instead, like physicists, they wrote generating function equations 
that come from thinking of cluster formation as a branching process. As 
the reader can see from the description, Lemma 2 is almost a known result. 
Since we need some of the details in the proof of Theorem 4, we will give a 
detailed proof for the special case that appears in Theorem 2. 

Our next step is to see what Lemma 2 says about our example. If pk is 
V(a) then fi = a/(l - e~ a ) so 

1 - p~ a a k+1 a k 

fc = __ (t + i ) .- ff - w _- r .-.- a) 

i.e., the Poisson distribution with mean a. A similar calculation shows q~k is 
the Poisson distribution with mean b, so the condition for the existence of 
a giant component is ab > 1. 

To compute the survival probability of the branching process, let <fii, <p2, 
ipi, and ip2 be the generating functions of pk, qk, Pk, and qk respectively. 



Consider our branching process, starting from one vertex on the left and 
conditioned on having one individual in the first generation. We call this 
the homogeneous branching process, because the different distribution at 
the first step has been eliminated. The number of offspring this individual 
has in the third generation has generating function ^2(^1 ( z ))- To check the 
order of the composition note that if TV has distribution q^ (N is the number 
of vertices in the second generation) and X±,X2, ■ ■ ■ are independent with 
distribution p^ (X\ is the number of children of a second generation vertex) 

then 

00 

E{z x 1+ .-.x N) = J2P(N = k)Mz) k = MM*)) (2) 

fc=0 

Let Cr be the smallest solution of ^(V'llO) = C i n [0) 1]) i- e -> the extinc- 
tion probability of the homogeneous branching process. By considering the 
number of individuals in the first generation, it follows that the extinction 
probability for the branching process starting with one individual on the left 

is 



We define Ql and £r similarly. 
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Theorem 2 Let a/(l — e~ a ) = t/m and 6/(1 — e ) = t/n and suppose that 
t/m — ► A, t/n — ► p. When ab < 1 the largest cluster is 0(log(m + n)). A 
giant component appears when ab > 1 . The fraction of vertices it contains 
on the left and right are asymptotically 1 — £j, and 1 — £r. The second largest 
component is 0(log(m + n)). 

To illustrate the phase transition we will consider some examples. In the 
human elephant comparison in Figure 1, a = 1.071 and b = 1.593 so ab = 
1.707. With a total of 49 vertices, it is hard to recognize a giant component, 
but there is one component with 13 human and 19 elephant vertices. Figure 
2 gives a comparison of human and colobine monkey, one of our fairly close 
primate relatives, which has a = 0.503, b = 0.605 and ab = 0.305. In 
agreement with subcritical designation, there are 12 components with 2 
vertices, three with 3 vertices, one with 4, and one with 6. Figure 3 gives a 
comparison of the human and cat genomes that has a = 1.151, b = 0.802, 
and ab = 0.925. Figure 4 compares humans and dogs, an example with 
a = 2.873, b = 1.477, and ab = 4.245. The drastic difference in the graphs 
in Figures 3 and 4 is somewhat surprising since the evolutionary distance 
from humans to cats and dogs are the same. In the human-dog graph there 



is one giant component and three components of size 2. To lead into our 
next topic we ask: Does the number of small components in these random 
graphs agree with what we expect? 

To get prepared for our next result, which will help us answer this ques- 
tion, we will give a second derivation of the threshold that is easy to believe 
but difficult to make rigorous. Suppose we are interested in some property 
of Gi(m, n, t). Define a and b by a/(l — e _a ) = t/m and 6/(1 — e~ b ) = t/n. 
Let G(M, N,p) be the random bipartite graph in which there are M = t/a 
vertices on the left, N = t/b on the right, and edges are independently cho- 
sen with probability p = ab/t = a/N = b/M. M and iV are defined this 
way so that after removing isolated vertices from each side we get a graph 
similar to one from G\(m, n, t). The calculation is not difficult: the number 
of non- isolated vertices on the left, M, has expected value 

EM = Mil - (1 - p) N ) « Mil - e~ a ) = -11 - e~ a ) = m, 

a 

the number of non- isolated vertices on the right has EN = n, and the 
number of edges, £ , has expected value E£ = MNp = Nb = t. Since all of 
the graphs in Gi(m, n, t) have the same probability under G(M, N,p). 

Lemma 3 The distribution of G(M,N,p) conditioned on M = m, N = n, 
£ = t is that of G\ (m, n,t). 

It is easy to show that when t/m — ► p and t/n —* A, Ai, M, and £ , will 
with high probability differ from their expected values by o{n). It is intu- 
itively clear, but seems hard to show, that the vector (A4,Af, £) satisfies the 
local central limit theorem, so the conditioning A4 = m, N = n, £ = t has 
probability 0(l/n 3 ' 2 ) and any property of G(M, N,p) that has asymptotic 
probability 1 — o(n~ ? '> 2 ) will be inherited by Gi(m,n,t). Once one believes 
this, the threshold result follows easily. G(M, N, P) has a giant component 
if 

t t fab\ 2 
1 < Mp ■ Np =-■-■( — j = ab 

For a new example, consider the number of (i,j) trees in the random 
graph, i.e., the number of trees with i vertices on the left and j vertices 
on the right. We let the tree size stay fixed while taking m, n, t to infinity. 
Once one knows that the number of labeled bipartite (i,j) trees is V~ l j % ~ 1 
(see e.g., Saltykov 1995), the expected number of (i,j) trees in G(M,N,p) 
can be derived by a calculation analogous to the standard one for trees in a 



unipartite random graph (see Bollobas (2001) Theorem 5.5). 

i j ~ 1 j i ~ 1 (e- b a) j (e- a by 

Based on the reasoning above we expect that the corresponding result 
will hold for Gi(m,n,t). 

Theorem 3 In G\(m,n,t), the expected number of(i,j) trees 

i*- 1 ?- 1 (e- b a)He- a b)H 
i\j\ ab 

Since the existence of (i,j) trees on disjoint sets of vertices are asymp- 
totically independent, we expect that the number of such trees will have 
asymptotically a Poisson distribution, but we have not tried to prove that. 

To see what Theorem 3 says, we will consider our four previous examples 
and a comparison of the human and lemur genomes given in Figure 5, which 
is somewhat surprising since this example has ab = 1.771 but no (1,1) or (2,1) 
trees. Table 1 compares the expected and observed number of (1,1), (2,1) 
and (1,2) trees. In general, there is good agreement between the observed 
and expected values. Two notable exceptions are the number of (1,1) trees 
in examples 4 and 5 where the expected values are 0.86 and 2.63 while the 
observed values are 3 and 0. If we assume that the number of trees has 
a Poisson distribution then the probability of three or more (1,1) trees in 
G^(22, 38, 67) is 0.097, while the probability of no (1,1) tree in G^(20, 22, 38) 
is 0.072. 

Our final problem is to determine when the graph will be connected. 
For the Erdos-Renyi unipartite random graph G(N,p) in which there are N 
vertices and edges are independently present with probability p, the transi- 
tion to connectivity occurs when p ~ (log N)/N. To see this we note that 
the number of edges incident to vertex is asymptotically Poisson(iVp). If we 
let p = c(logN)/N, the probability of an isolated vertex is ~ 1/-/V C , so the 
expected value is large when c < 1 and goes to if c > 1. Isolated vertices 
prevent connectivity, so a second moment calculation shows that if c < 1 
the probability of connectivity goes to 0. 

The result in the other direction is more difficult, since one must consider 
all of the ways in which the graph can fail to be connected. A simple 
calculation (see Bollobas 2001, p. 104) shows that if p = 8/N and 9 = 
o(N 1 ' 2 ) then the expected number of trees with v vertices, T v , has 



From this we see that if 9 = clogiV and 1/2 < c < 1 then asymptotically 
there are isolated vertices, but no trees of size v > 2. Bollobas (2001), see 
Section 7.1, combines this estimate with the fact that the largest tree in a 
supercritical random graphs has O(logn) vertices to prove (see Theorem 7.3 
on page 164) that if 6 = log N + x + o(l) then the probability G(N,p) is 
connected approaches exp(— e~ x ). 

Saltykov (1995) has considered a question closely related to the connec- 
tivity problem for the random bipartite graph G(M, N, T) in which there 
are M vertices on the left, N vertices on the right, and T edges. Suppose 
M > N . Let a = M/N and (3 = (1 — 1/a) log N. His main result asserts 
that if 

(1 + l/a)T =(M + N){log(M + N) + x + o(l)} 

then the number of isolated vertices has asymptotically a Poisson distribu- 
tion with mean 

e~ x (l + e-1 3 ) 



A 



1 + 1/q 

Recalling a = M/N, we see that the transition to connectedness occurs 
when T ~ Mlog(M + N). 

The corresponding result for our bipartite random graphs is 

Theorem 4 Define c by t = c^^log{m + n) and suppose m/n —> a, a 
positive finite limit. The probability G\{m,n,t) is connected tends to or 1 
depending on whether c has a limit < 1 or > 1. 

Note that our threshold is asymptotically y— ^mlog(m + n). The difference 
in thresholds should not be surprising given the results for E p (T v ) cited 
above. Our threshold is for the disappearance of (1,1) trees rather than the 
absence of isolated vertices, so this occurs at a smaller value of t. 

The remainder of the paper is devoted to proofs. We take the results in 
the same order as in the introduction. 

2 Proof of Corollary 1 

Corollary 1 Under the assumptions of Theorem 1, 

-ox^v ■ f \Gi(m,n,t)\ \Gi(m,n,t)\ 

e ' < lim ml , — . — < hm sup , — . — < 1 

\G\{m,n,t)\ ~ ^ \G\{m,n,t)\ ~ 

Proof. The inequality \Gi(m,n,t)\ < \G\{rn, n,t)\ is trivial and proves the 
result for lim sup. To prove the other result let E = A n B be the event that 



there are no isolated vertices and let F be the event that all edges chosen are 
distinct. Let P denote probabilities under Gi(m,n,t). From the thought 
experiment of sampling with replacement until we have t distinct edges it is 
clear that P(E\F) > P(E) because if a graph has no isolated vertices after 
the first t edges are chosen, it will have no isolated vertices when t distinct 
edges are chosen. From this we get 



0| _ P(EnF) _ P(E\F)P(F) 

t)\ ~ P(E) P(E) ~ l ) 



\Gi(m, n 
\G[(m,n,t)\ 

The result for liminf now follows from the result for the birthday problem 
cited in the introduction, which gives the limiting behavior of P(F). ■ 



3 Proof of Theorem 2 

Theorem 2 Let a/(l — e~ a ) = t/m and 6/(1 — e ) = t/n and suppose that 
t/m — ► A, t/n — ► p. When ab < 1 the largest cluster is 0(log(m + n)). A 
giant component appears when ab > 1. The fraction of vertices it contains 
on the left and right are 1 — £l and 1 — £r. The second largest component 
is 0(\og(m + n)). 

The first step is to make the connection between the cluster size and the 
total progeny in a branching process. To do this, we note that instead of 
making all of the choices in pairing the duplicated left and right vertices at 
once, we can do them sequentially. Suppose that we start with vertex l\. 
We then choose d{l\) times without replacement from the duplicated set of 
right vertices R' . Let fi(rj) be the number of times vertex rj is chosen and 
let J\ = {j : fi(rj) > 0}. For each j G J\, choose d{rj) — fi(rj) times 
without replacement from the duplicated set of left vertices L' minus the 
d(l\) copies of l\. Let f2{lj) be the number of times vertex lj is chosen, let 
J2 = {j '■ f'2(lj) > 0}, etc. We continue this procedure until the cluster 
containing l\ has been constructed. We then choose some vertex not in the 
cluster containing l±, generate its cluster, and continue until the random 
graph has been constructed. 

From the construction it should be clear that if YP 1 = \Jk\ is the number 
of vertices in generation k (of a graph from TP(m,n,t)) then asun oo, 
{Y™,k > 1} converges to the branching process described in the introduc- 
tion. There are two differences between the growing cluster and the limiting 
branching process. The first is that the possible choices are dictated by the 
empirical sequence of degrees d(l±), . . . d{l m ) and d(r±), . . . d{r n ) rather than 



the truncated Poisson distributions. The second is that the set of available 
degrees changes as choices are made. 

The first difference disappears as m -» co since by the law of large 
numbers, the empirical distribution of degrees converges to the underlying 
theoretical distribution. To estimate the effect of the second, let r& be a 
probability distribution on the positive integers, let r] > 0, and let W(uj) be 
a nondecreasing function of w 6 (0,1) so that the Lebesgue measure \{u> : 
W(uj) = k}\ = rfc. We say that W is the mass function of distribution r. If 
we remove an amount of mass r\ from the distribution and renormalize to get 
a probability distribution, then the result will be larger in distribution than 
U = (W(u>)\u> < 1 — n) and smaller in distribution than V = (W(uj)\uj > rj). 
Note that EV < EW/(1 - rj). 

Subcritical Case. Suppose ab < 1. Pick rj > so that ab/(l — rj) < 1. 
Let j5™ and qV 1 be the empirical distributions of the degrees of vertices on 
the left and on the right, let \i* m and u^ be the means of these empirical 

distributions, and p* m = Ylk(^~ l)Pfc V // / m an< ^ ^m = J2k(^~ ^)$k l v m be the 
means of the size biased distributions. Since p^ and q^ have finite second 
moments it follows from the law of large numbers and (1) (pg. 4) that 
p,* m ^a and v* m -» b. 

From the choice of r\ it follows that if m is large then until a fraction r\ of 
vertices have been used up on either side, the growing cluster is dominated 
by a subcritical branching process. To estimate the growth of the cluster, 
we take the approach of Molloy and Reed (1995) and expose the cluster of 
right vertices one at a time, i.e., we pick one of the current set of active 
right vertices and go through two generations to identify the right vertices 
connected to it. The chosen right vertex is removed from the set of active 
vertices and the new ones are added; we call this a step. Vertices in early 
generations need not be exposed before vertices in later generations, as de- 
scribed at the beginning of the section; any active vertex may be exposed 
at each step. 

To prove the lower bound on the critical value, we will show that if 
ab < 1 then for large m the largest cluster is 0(log(m + ri)). Pick a right 
vertex at random and let X be — 1 plus the number of right vertices that 
can be reached in two steps in the branching process (z x+1 = ip2{' l Pi{z)) ). 
Assuming cluster growth is a branching process, this represents the change 
in the size of the set of active right vertices in one step of the construction. 
Let Si = So + X\ + . . . + X£, where Xi are independent with distribution X. 
When So = 1, St gives the size of the active set of vertices after £ vertices in 
the cluster have been exposed. The random variable r = inf{£ : St = 0} has 
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the same distribution as the total progeny of the homogeneous branching 
process starting from one right vertex. 

In the limiting branching process k(9) = Ee < oo for all 9. Since 
k(0) = 1 and k'(0) = EX < in the subcritical case, there is a 9 > so 
that k(9) < 1. Therefore 

P{t >k)< P(S k > 1) < Ee eSk = K(9) k (3) 

so we have a bound on the total number of individuals in the branching 
process. To extend the last result to the growing cluster, we begin by ob- 
serving that if X is the corresponding quantity for the empirical distribution 
then the strong law of large numbers implies Eexp(9X) — > Eexp(9X). If 
X v is the distribution that dominates choices made at any time before a 
fraction r\ of the vertices have been used on the left or the right, then (from 
the discussions earlier) Eexp(9X' n ) < E exp(9X)/(l — rj). So if m is large 
and 7] is small E exp(9X v ) < 1. It follows from 01 that there is a 7 > so 
that P(t > k) < e~ 7fe . If we take k = (2/7) logn then P(t > k ) < n~ 2 . 
This and the corresponding argument for left vertices proves that the largest 
cluster is 0(log(m + n)). 

Supercritical Case. Given distributions d and d, \\d—d\\ = (1/2) ]P fe \dk~ 
dk\ is the total variation distance. If m is large and the fraction of vertices 
chosen on either side is at most rj, then the cluster growth process dominates 
a branching process with offspring distributions pk and q^ with ||p — p|| < 2r\ 
and \\q — q\\ < 2rj where p and q are the size biased degree distributions. Let 
W p be the mass function of p. Among all distributions p with \\p — p\\ < 2rj, 
the smallest one, p* 1 , is the distribution with mass function Wp, Wp(u) = 
W p (uj - 2rj), w G (2rj, 1] and Wj!(u) = 0, w E (0,2rj\. Define W q , W%, and 
<f in the analogous way. 

If we let a v and b v be the means of p n and <f then the dominated 
convergence theorem implies that as rj — ► 0, we have a v — > a and 6„ — ► 6, so 
a^b^ > 1 for small rj. Now if < z < 1 we have 



£p** fc -I>* 



- X^ _Pfc i - 4?? "^ ° 



From this we see that if tpl and t/^ are generating functions of p^ and q* 1 
then, uniformly on [0, 1], we have ty\ — ► V>i, i\)\ —> ip2, and ^2 W'l) ~~ ^ ^(ipi)- 
This uniform convergence implies that the smallest fixed point of ^2 W'l) 
converges to that of ifotyi), i.e., the extinction probability £□ — ► £r as 
r/ — ► 0. In a similar way we can conclude C^ L — ► Cl, Cl ~^ Cii an d Cr ~~ *■ Cr- 
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To study the size of clusters, as in the previous proof, we expose them 
one right vertex at a time. When we expose the grandchildren of an active 
vertex, one of them might already be in the active set. We call such an 
event a collision. If a collision occurs, instead of adding the grandchild 
to the active set (as is usually done), we remove it from the active set. To 
show that this does not slow down the branching process too much, we must 
bound the number of collisions. When we look at the left vertex children of 
a right vertex, we cannot encounter one we have seen before, because the 
first time a left vertex is visited, all of its other right vertex neighbors are 
added to the active set and all collisions are removed. Note that p* 1 and <f 
are concentrated on {0, ... , L} where L = max{Wp (1), W^(l)}. Thus until 
6n vertices have been exposed on the right, the number of edges with an 
end in the active set is at most 5nL. The probability of picking one of these 
edges in the exposure of an active vertex is at most 5nl? j(t — 5nL 2 ) = 7. 

Let Z be the number of grandchildren in the branching process in which 
the first generation is according to if 1 and the second according to i p n . Let 
Y be the distribution of grandchildren in the branching process modified to 
correct for collisions; Y = Z — 2 ■ Binomial(7, Z). Therefore if 5 is small, 
EY = a v b v (l - 2 7 ) > 1. 

Let X = Y — 1 and define Se as before. Since EX > the random walk 
has positive probability of not hitting 0, so there is positive probability that 
the cluster growth persists until there are at least 5m left vertices or 5n 
right vertices. To prove that we will get at least one such cluster with high 
probability, it is enough to show that with high probability all unsuccessful 
attempts will use up at most 0(log(m + n)) vertices. For this guarantees 
that we will get a large number of independent trails before using a fraction 
5/2 of vertices on either side. 

The random variable X is bounded so n(8) = Ee < 00 for all 8. k(8) is 
convex, continuous and has rc'(0) = EX > 0, k(8) ~ P(X = — l)e — ► +00 
as 8 — > -co, so there is a unique A > so that «(— A) = 1. In this 
case Eexp(-XSk) is a nonnegative martingale. Due to the possible removal 
of active vertices, the random walk may jump down by more than 1, but 
its jumps are bounded so the optional stopping theorem implies that the 
probability of reaching from Sq = x is < e~ Xx . 

The last estimate implies that the probability that the set of active 
vertices grows to size (2/ A) log n without generating a large cluster is < n~ 2 . 
Routine large deviations estimates for sums of independent random variables 
show that if C is large, the probability that the sum of C log n independent 
copies of X is < (2/ A) log n is at most n~ 2 . Thus the probability of exposing 
more than Clogn vertices and not generating a large cluster is < 2n~ 2 . 
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Combining this with the estimate for left clusters, we have our bound on 
unsuccessful attempts and can conclude that with high probability there is 
a large cluster. 

To hnish up now, let e = 5/L 2 . Since the maximum degree of any vertex 
is L, we can expose en right vertices without using up 5n vertices on either 
side. A routine large deviations estimate shows that 

P(S en < (en)EX/2) < Ce~ cn 

Consider now two vertices i and j . If their clusters reach size C log n then the 
probability one of them will fail to continue until en right vertices have been 
exposed is < 4n~ 2 . If the number of right vertices of their clusters reach size 
en and they have not already intersected, then with probability > 1 — 2Ce~ cn 
each has an active set of size > (en)EX/2. The probability they will fail to 
intersect on the next step is exponentially small. With probability tending to 
1, all vertices in clusters larger than Clogn belong to the giant component, 
and therefore the second largest component is 0(log(m + n)). 

Our final task is to prove the claim about the fraction of vertices on 
the left and right that belong to the giant component. Previous arguments 
have shown that if 5 is small, the extinction probability for the comparison 
branching processes are ~ £j,. We have shown that membership in the 
giant component is essentially the same as belonging to a component of size 
> Clogn. Now, the probability of a collision before reaching size Clogn is 
at most 

(Clogn) 2 -— (4) 

n 

so if ljgG is the indicator function that left vertex i is part of a component 
of size > C log n then Eil^o) w 1 — £z> When two clusters do not intersect, 
their growth is independent so (jlj implies that 



var (^2 Itec?) < m 2 (Clogn] 



2 



i=i 



n 



Chebyshev's inequality implies 



i(|> G -p( !e c)) 



in probability and the desired result follows. 
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4 Proof of Theorem 3 

Theorem 3 In G\{m,n, t), the expected number of (i,j) trees 

EA ' j "" —\ ab 

Proof. Let T be a fixed vertex labeled (i,j) tree (left vertex labels are some 
subset of {1, 2, . . . , m} of size i), let k = \E(T)\ = i + j — 1, and let D be the 
event that it exists as a component of our random graph. Let C(m, n, t) be 
the number of edge-labeled multigraphs belonging to G\{m,n,t). 

P(D) = ( t 1 )k! C{m - i > n - j > t - k) 



kj C(m,n,t) 

The ( fe )fe! term comes from all the ways of labeling the edges of the tree 
and dividing the labels between tree and non-tree edges. From lemma^ we 
know 

C(m,n,t) = ?aM. .aj.^b^bl. 

a o 

By symmetry it suffices to study the m part of the equation. From the proof 
of Theorem 1, we have 

± tt 



ai!a 2 !...a m ! a^2ira^ 



w 



Thus C(m — i,n — j,t — k)/C(m, n, t) is the product of two symmetric terms; 
the one containing m is 



(e a ' - l)" 1 -^ - k)\ . (e a - l) m t\ 
j2Tra 2 a ,(m-i)a' t - k V 27rcT l mat 

where a' is determined by a'/(l — e~ a ) = (t — k)/(m — i) 
The expression above is equal to 



(5) 



m—i 



l\ ( a\ l ~ k a a I m a k (t — k)\ 



\ e a — 1 / \a' / a a ' \J m — i (e a — 1)* t\ 

Since i and k are fixed a' tends to a and o~ a ' —> o~ a 



a n / m e a — 1 \ /a\- fc 



a a i V m — i \ e a — 1 ' \ n ' 
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(6) 



To complete the proof, we will show that 

( e a - 1 \ / a \ t 



\ e a - 1 
This enough since it implies 



(7) 



P(D) = (% ^-k)\y(t-k)\ 

[ ' \k) (e* - lftl (e b - l)H\ {) 



a- 



j-ty- 1 d l W {e.- h a)3(er a b)H 



t k ( e a _ iy ( e fe _ iy m^ab 

Multiplying this by V f (™) (") , the number of vertex labeled (i,j) trees 
on (m, n) vertices, and taking limits gives Theorem 3. 
To prove ® we use the definitions of a and a' to get 

/ a ' i \ m * / j. ■ t\ m , 

/ e — 1 \ / a \ * t m — i a \ ,„i „\ m / a \ * 



_ j e (a'-a)m I ^_ \ ,g\ 

1 I \a'J \m t-k a J \a' J 

To simplify these terms, we compute a' — a. Let /(a) = a/(l — e~ a ). The 
definition of the derivative implies 

Q / _ „ /( q/ ) - /( Q ) = _L_ p~ fc _ L 

/'(a) /'( a ) \?72 — i m 

The next step is to note 

t - A; t k ti „( l\ 

:= + — +° — 10 

m — i m m m z \m z J 

and conclude that 

/ 1 Ai — /c 

/'(a) m 



Now the first term on the RHS of @ is 

• \ m / 7 \ m 



m J \ t — k ) \ a 

i k a — a , . . \ . , 

1 - - + + + o 1/ro 12 

m t — k a J 

If k Xi-k 

1 + — -I + T + 



m \ A af'(a) 
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if t/m — ► A. By (|TT)l the second term on RHS of @ converges to exp ((Ai — k)/f'(a)). 
For the third term we write 



a \ * / , / a' — a 



-) =exp^-tlog^l + 

Using l|TT)l and expanding log(l + x) = x + 0(x 2 ) shows that the third term 
converges to 

f(Xi-k)(-X) _, 1A /(Ai-£:)(-A) 

exp i — il — -+0(— ) ->exp' 



af'(a) m J \ af'(a) 

Adding the three exponents gives 

(A, - k) -T + T7TT + 



A /'(a) af'(a) 

We want to prove this is 0, so we can ignore the factor in front. Combining 
the fractions over a common denominator, discarding that denominator, and 
recalling X = f(a) we have 

-af(a) + (a + l-f(a))f(a) 

To check that this is zero, we note that differentiating f(a) = a/(l — e~ a ) 
gives 

and the proof is complete. ■ 



5 Proof of Theorem 4 

Theorem 4 Define c by t = c ™™ log{m + n) and suppose m/n — ► a, a 
positive finite limit. The probability G\{m,n,t) is connected tends to or 1 
depending on whether c has a limit < 1 or > 1. 

We can assume without loss of generality that m > n and hence a > 1. 
The first half of the proof is to establish: 

Lemma 4 Under the assumptions of Theorem 4, if c has a limit < 1 then 
the probability G\{m,n,t) is connected tends to 0. 
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Proof. Our first step is to show that the asymptotics in the previous section, 
which were derived under the assumption that t, m, and n were all of the 
same order, continue to hold under the assumptions of Theorem 4. To do 
this, it suffices to show that © and (J7J) hold. We begin by noting that 
t/m — ► oo implies a — > oo and 1 — e~ a — ► 1, so a ~ t/m. To verify (JBJ) we 
observe that since a — > oo, cr 2 /a — > 1, and a a /a a i — > 1. In addition we will 

soon see that a' — a — » 0, and therefore ( e a Z\ ) — y e *(a-a ) _> ^ . 

To prove (J7J), we begin, as before, by computing a' — a. As we have 
already noted 

t CTl C 

a ~ — = log(m + n) — > log(m + n) (14) 

m m + n 1 + a 

The fact that a ~ t/m and the definition of a implies that for large m 

- > a > - (l - e^ 2 ™) > - (1 - (m + n)" £ ) (15) 

m m V / m ' 



for some e > 0. Since a — > oo, we have f'(a) — > 1. Using this with 1)11(1 and 
JTUJ) it follows that 

, t — k t k ti n f 1 \ , . 

a -a : = + +0 (16) 



m — i m m m? \m 2 7 

This leads to the asymptotic formula 

, ci log(m + n) k ai — k 

a — a ~ ~ (17) 

1 + a m m m 

Now we analyze the first term in the decomposition: using t/m ~ a and 
(HIJ), (H2) becomes 

1 / k ai — k^ 



1 + - l-i + - + -^ -► exp l-i + — - -► 1 (18) 

m \ a af'{a)J J \ f'{a)J 

The second and third terms in Q are 

e (a>-a)m /«j' = gxp A, _ fl)m _ f bg A + ^^ 



c \ , / , „ / a — a 



Expanding log(l + x) = x + 0(x 2 ) the exponent becomes 

{a' -a) -tO ( - 
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1J15J) implies that the absolute value of the first term is 

< (m + n)~ e ■ —\a! — a\ —>■ 
a 

by (|16|) and a ~ £/m. To prove that the second term tends to 0, we note 
that t/a ~ m and use (|17[) and (|14|1. Thus we have 

e (a'-a)m f « V _+ j ^ 



Combining this with (|18j) gives (J7J). 

Let T\,T2 be fixed disjoint trees of size (z, j). Let .Ajj be the number of 
(i,j) trees that are components of our random graph, with Dq- indicating 
whether T is a component. Writing A44 = Y2t ^t, squaring and taking 
expected value we have 



E ( A W = [■) {■) ii 3 - l r l )E{D Tl ) (20) 

m\ (m - i\ fn\ (n - j\ , x , l2 



The last term counts the number of disjoint (i,j) trees; overlapping trees 
contribute nothing to the sum. To calculate E(Dt 1 Dj- 2 ), we note that cal- 
culations at the beginning of this section have shown 

C(m — i,n — j,t — k) a k b k 



C(m, n, t) (e a - l)H k (e b - l)H k 

so we have 

p( D7 - 1 = i= Dl ,)=f',V^ c<m - 2,: ' n - 2i ' ( - 2fc) 



,2k 
~ t 



2k J y '' C(m,n,t) 

a 2k b 2k 



(e a - l) 2i t 2k (e b - l) 2 H 2k 



where the L,)(2fc)! term comes from all the ways of labeling the edges of 
the trees and dividing the labels between the two tree's edges and the other 
edges. Recalling ©, we have that E{D ri Dr 2 ) ~ E(D Tl ) 2 and therefore (J20J) 
implies E{A 2 ^) ~ E(Aij) + E(Aij) 2 . We wish to show that E(A 1A ) -» 00 
so that we can conclude E(A\ 1 ) ~ E(A\^) 2 and apply the second moment 
method. 
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To see this, observe that a < t/m. Then the simplified expression for 
E(Aij) when i = j = 1 is bounded as follows 

E(A hl ) ~ e- a e~ b t > e~^ n e- t/rn t = (m + n)- c t 

= (m + n)~ _c • cmn log(m + n) (21) 

Since c < 1 and m/n — ► q, a constant, this expression goes to infinity. Now 
applying the second moment method yields P(A\ i i = 0) — > which tells us 
that the probability of the existence of a (1,1) tree goes to 1, and gives the 
desired result. ■ 

Before tackling the other direction we need a preliminary result 
Lemma 5 Let Z have truncated Poisson distribution with mean A/(l— e _A ). 

P(Z < A/2) < exp(-0.15A) (22) 



If L> l/ln2 then 



1 
1~ 



P(Z>LX)<- - zx exp(A-LAln2) (23) 



Proof. Let Z' be the Poisson distribution with mean A. The moment 
generating function is Ee ez = exp(A(e 6 ' — 1)), so if 6 < 

e ex ' 2 P{Z' < A/2) < exp(A(e e - 1)). 

Taking 6 = - In 2 

P(^<A/2)<exp(-^(l-ln2)) 

Since In 2 < 0.7 and P(Z < A/2) < P(Z' < A/2) the first result follows. For 
the second we note that if 9 > 

e eLX P{Z' > LA) < exp(A(e 9 - 1)) 

Take 9 = In 2 and note that since L > we have 

P(Z> LA) = l — T P(Z'>LX) < ^- T exp(A-LAln2), 

1 — e A 1 — e A 

the desired result. ■ 
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Lemma 6 Under the assumptions of Theorem 4, if ' c has a limit > 1 then 
the probability G\{m,n,t) is connected tends to 1. 

Proof. Under the assumptions of Theorem 4, a ~ {cn/m+n) log(m+n) and 
b ~ (cm/m + n) log(m + n). Let r = lim cn/(m + n) and s = lim cm/ (m+n). 
Without loss of generality s > r, i.e., n > m. Our first step is to get an upper 
bound on the maximum degree of a vertex, D. By Q23J) with X = s log(m + n) 



P(D > Ls\og(m + n)) < cexp(slog(m + n)(l — Lin 2)) = c(m + n 



,s—sLln2 



where c = 1 _/ m \ n \- s ■ Taking L = (2 + s)/(sln2) the right-hand side is 
< 2(m + n)~ 2 for sufficiently large m + n. Assume for the rest of the proof 
that D < L\og(m + n). 

The number of vertices in the first four generations is at most iV = 
^ i=1 (Llog(m + n))\ We will show that with high probability, iV is at least 
0((log(m + n)) 2 ) and this cluster will connect up to all others. Using the 
trivial inequality t > max{m, n} > (m + n)/2, the probability that two 
edges pick the same vertex in the first four generations (call this a collision, 
as before) is 

r,D 2Llog(m + n) 

< at 2 _ < N 2 y 



t m + n 

This is too big to ignore but the probability of two or more collisions is 

f ,,, ' 2Llog(m + n) \ 2 < g _(log(m + n)) 18 



m + n J (m + n) 2 

so with high probability there is at most one collision in the first four gen- 
erations of the cluster containing any vertex. 

Our assumptions imply r + s > 1, so we can pick r' < r and s' < s with 
r' < s' and r' + s' G (1,2). Pick K so that Kr' (0.15) > 2. If o > In 2 (which 
will be true for large m), then in the associated branching process (Zj = the 
number of vertices in generation i) 

i K+ 1 k 

P(Z 1 <K + 1) = V e- a ^- < 2(K + l)e~ a a K 

1 — e~ a ^-^ k\ 
fe=l 

a ~ rlog(m + n) so if m is large 

P(Zi < K+l) < (n + m)- r ' 
By similar reasoning if m is large 

P{ Z 2 < K + l\Zi = j ) < (n+ m)" s ' for all j > 1 
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From this it follows that 

P(max{Zi , Z 2 } < K + 1) < (m + n)-^ r ' +s ' ) (24) 

So with high probability Z\ or Z2 is large and this implies Z± is large with 
high probability. For 2 < i < 3 divide individuals in generation i into groups 
of size K. Since the sum of independent Poisson distributions is Poisson and 
the truncated Poisson distribution dominates the Poisson distribution, we 
may apply (|22|) to each group of size K. 

P(children of group < Kr'log(m + n)/2) 
< exp(— 0.15AY log(m + n)) < 



(m + n) 2 
Trivially, the number of groups in generation i is < (Llog(m + n)) 1 so 

Zj Kr' login + m) , N (Llog(m + n)) 3 

P Z *+l < 17 ^ " Zi>K)< { feV JJ 

Using this with ()24() we can conclude that there is a constant 5 > for large 

771 

P( ^4 < 5(log(m + n)) 2 ) < \(m + n)"( r ' +s ') 

This shows that with high probability all clusters have size at least <5(log(m+ 
n)) 2 . It follows from the proof of Theorem 2 that with high probability all 
clusters will grow to size 5n and connect. For readers who may be concerned 
with how the constants in that proof depend on a and b we note that all we 
need is a lower bound on the growth so for this phase of the argument, we 
can fix a' < a and b' < b with a'b' > 1. Theorem 2 does not apply when a 
and b are 0(log(m + n)), but all we need is a lower bound, so it suffices to 
apply Theorem 2 with a' and b'. ■ 



6 Appendix 

Proof. We will prove that the models Gi(m, n,t) and TP(m,n,t) are the 
same by looking at the distributions they induce on the set of edge labeled 
multigraphs. To do this, we will have to augment the model descriptions to 
label the edges. If we pick edges with replacement and label the edges in 
the order drawn then the set of outcomes 0, written as vectors of edges, has 
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(mrif elements and G r (m,n,t) is uniform over the subset Qq in which each 
vertex has degree at least one. 

To label edges in TP(m,n,t), first generate 1/ and R', the duplicated 
sets of vertices. Attach to the elements of V numbers chosen at random 
from {1,2, ...t} and call these edge-labels. Do the same independently for 
R'. Connect the element edge-labeled i in V and the element edge-labeled 
i in R', and label this edge i. 

Consider an outcome wo £ ^o with degrees ii,...i m and the left and 
ji, . . . j n on the right. By calculations in the introduction, the probability 
that a graph in TP will have the same degrees as wo is 

t\ t\ I 

S(m,t)S(n,t) 



hW- ■ ■ ■ im } - JlW- ■ ■ ■ Jm [ - 



where S(m,t) and S(n,t) are normalizing constants that make the sum 1. 
Now wo's edge labels determine the edge labels incident to each vertex. 
For each left vertex i, let Ef be the set of edge labels incident to i in wq; 
similarly, let E^ be the set of edge labels incident to right vertex j. In order 
for TP to generate wq, for each left vertex i, the labels of the set of vertices 
in V that collapse to i must be Ef (but the order of the labels among the 
collapsing vertices doesn't matter). A similar statement holds for the right 
vertices. The probability that vertices are labeled as described is 

t\ t\ 

so the edge labeled graphs generated by TP(m, n, t) are also uniform on Oq- 
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Figure 1. Comparison of elephant and human genomes. Data from Yang et 
al. (2003). m = 22, n = 27, t = 44, a = 1.126, b = 1.654, ab = 1.863. 
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Figure 2. Comparison of human and colobine monkey (Colobus guererza) 
genomes. Data from Bigoni et al. (1997). m = 22, n = 21, t = 28, a = 0.581, 
b = 0.685, ab = 0.397. 
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Figure 3. Comparison of human and cat genomes. Data from Weinberg et 
al (1997) and Murphy et al (1999). m = 22, n = 19, t = 32, a = 1.151, 
b = 0.802, ab = 0.925 
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Figure 4. Comparison of the human and dog genomes. Data from Breen et 
al. (1999). m = 22, n = 38, t = 67, a = 2.873, b = 1.477, ab = 4.245. 
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Figure 5. Comparison of lemur (Eulemur macao macao) and human gen- 
omes. Data from Miiller et al. (1997). m = 20, n = 22, t = 38, a = 1.458, 
b = 1.214, ab = 1.771. 
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Example 


1 


2 


3 


4 


5 


human 


elephant 


monkey 


cat 


dog 


lemur 


a b 
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Table 1. Expected number of trees of various sizes compared with the num- 
ber observed in our five examples. 
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